Unsupervised SMT for a low-resourced language pair
نویسندگان
چکیده
This paper presents an unsupervised method in application of extracting parallel sentence pairs from a comparable corpus. A translation system is used to mine the comparable corpus and to withdraw the parallel sentence pairs. An iteration process is implemented not only to increase the number of extracted parallel sentence pairs but also to improve the quality of translation system. A comparison between this unsupervised method and a semi-supervised method is also presented. The unsupervised extracting method was tested in a hard condition: the parallel corpus did not exist and the comparable corpus contained up to 50% of non parallel sentence pairs. However, the result shows that the unsupervised method can be really applied in the case of lacking parallel data.
منابع مشابه
SMT-based ASR domain adaptation methods for under-resourced languages: Application to Romanian
This study investigates the possibility of using statistical machine translation to create domainspecific language resources. We propose a methodology that aims to create a domain-specific automatic speech recognition (ASR) system for a low-resourced language when in-domain text corpora are available only in a high-resourced language. Several translation scenarios (both unsupervised and semi-su...
متن کاملEmploying Pivot Language Technique through Statistical and Neural Machine Translation Frameworks: the Case of Under-resourced Persian-spanish Language Pair
The quality of Neural Machine Translation (NMT) systems like Statistical Machine Translation (SMT) systems, heavily depends on the size of training data set, while for some pairs of languages, high-quality parallel data are poor resources. In order to respond to this low-resourced training data bottleneck reality, we employ the pivoting approach in both neural MT and statistical MT frameworks. ...
متن کاملTmTriangulate: A Tool for Phrase Table Triangulation
This work was supported by the grants no 645452 (QT21) and no 644402 (HimL) of the EU and SVV 260 104 of the Czech Republic. We used language resources hosted by the LINDAT/CLARIN project LM2010013 of the Ministry of Education, Youth and Sports. Introduction Under-resourced language pair: Scarcity of parallel corpora SMT Problem: No direct data → no SMT training Insufficient data → poor SMT per...
متن کاملExtraction de corpus parallèle pour la traduction automatique depuis et vers une langue peu dotée. (Extraction a parallel corpus for machine translation from and to under-resourced languages)
Nowadays, machine translation has reached good results when applied to several language pairs such as English – French, English – Chinese, English – Spanish, etc. Empirical translation, particularly statistical machine translation allows us to build quickly a translation system if adequate data is available because statistical machine translation is based on models trained from large parallel b...
متن کاملA Hybrid Machine Translation System Based on a Monotone Decoder
In this paper, a hybrid Machine Translation (MT) system is proposed by combining the result of a rule-based machine translation (RBMT) system with a statistical approach. The RBMT uses a set of linguistic rules for translation, which leads to better translation results in terms of word ordering and syntactic structure. On the other hand, SMT works better in lexical choice. Therefore, in our sys...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010